Add RCU read bindings (#143) and each_process() iterator #250

geofft · 2020-08-17T04:49:19Z

Add an RcuReadGuard object for calling functions that expect to be
RCU-protected, and add a Rust equivalent of for_each_process and very
light task_struct bindings sufficient to write a simple test/demo.

geofft · 2020-08-17T04:52:43Z

Obviously the test case needs work but I'd like some feedback on this design. I'm liking it, but at this point I've made myself familiar with how RCU works :)

nelhage · 2020-08-17T05:40:28Z

tests/for-each-process/src/lib.rs

+    ) -> linux_kernel_module::KernelResult<()> {
+        let g = rcu::RcuReadGuard::new();
+        for mut p in sched::each_process(&g) {
+            buf.write(&p.comm())?;


Writing to a user pointer can sleep, right, so this is incorrect, since you may sleep during an RCU read-side critical section?

Yeah I think you're right. How would you do this in C? Make a local variable, fill it in, and then copy_to_user after rcu_read_unlock?

I can do that, I'm not sure how to get Rust to catch it, and I'm curious if Sparse would have caught it.

Note that this is tricky, since in the kernel we usually want to avoid dynamically-sized allocation if we can (especially with lengths under user control), so copying the entire output into memory is frowned upon, but also, once we rcu_read_unlock, we can't guarantee our struct task_struct * linked list doesn't change out from under us.

In practice, something like this would probably be built on the seq_file abstraction. seq_file users have a "position" abstraction that they must use to track progress through their underlying data source, and in practice, the seq_file infrastructure implements a single-page bounce buffer that callers write into (one page at a time), and then seq_file copies that back to userspace.

But wait, you might ask, how do you use that position abstraction to keep your position in the task_struct linked list across each page of data? Do you have to get_task_struct the current finger into the list?

Well, no, and in fact that would be illegal, since it would allow userspace to construct a long-lived reference to a task by opening your file, advancing through it with small reads until it got to the point you wanted, and then just sitting on the fd. By sending that fd to another process you could construct a circular list of task references, which is no go.

So instead, you would just keep your cursor as an integer "index into the task list", and, each time you were called upon to return another page of data, you would start with init_task and advance N tasks forward and resume printing.

This is, you may note, both (a) quadratic, and (b) racy.

And yes, this is how almost every file in /proc works; If you read (e.g.) /proc/mounts and it takes more than a single read() call to return the full contents, you can miss mount entries, even ones that existed continually for the entire open/read/read/close cycle, if there are concurrent mounts and unmounts.

This was a fun day at Ksplice

I switched to printk, which one hopes is safe from RCU context. (Our println! is heap-free.)

Yeah, printk is safe in any context, even really hostile ones like NMIs or stop_machine

alex

Is there a good reference on RCUs to get up to speed?

@nelhage's comment leaves me a bit concerned that it's incredibly easy to accidentally violate the rules leading to deadlocks. Is that just life in C as well?

alex · 2020-08-17T12:13:14Z

src/sched.rs

+    /// typically the base name of the command, and does not have the
+    /// full path or arguments. It's a fixed-sized set of bytes, but by
+    /// convention it's interpreted as NUL-terminated.
+    pub fn comm(&mut self) -> Vec<u8> {


No practical way to return an &[u8]?

self.0.comm is only valid while we've got a lock on self.0, and I don't want to hold a lock for the lifetime of this structure.

We could probably expose the locking to Rust, i.e, something like

impl TaskStruct { fn lock(&mut self) -> LockedTaskStruct; } impl LockedTaskStruct { fn comm(&mut self) -> &[u8]; }

or perhaps even something like

impl TaskStruct { fn comm(&mut self, g: &mut LockGuard) -> &[u8]; }

though given the lack of NLL, I'm more hesitant about using an RAII guard to lock a task_struct than I am about making an RCU read-side critical section. An RcuReadGuard only prevents cleanup after making changes, it doesn't prevent a writer from doing the reading, copying, or updating.

Is there a reason not to return a [u8; TASK_COMM_LEN]? This is basically what get_task_comm does in C.

Hm, that would work. I was actually leaning towards taking an &mut [u8] but I guess yeah you can just return the array and make the compiler deal with it.

(This code was largely just "make anything work at all")

Switched it to returning a [u8; TASK_COMM_LEN].

It seems that many of my task structs have some garbage after the null byte. For instance, if in Python you open("/proc/self/comm").write("foo"), you'll find that your comm is now foo\0on\0\0\0... because it was previously python. I could return this directly, but because we support memory safety here on Team Rust, and because I had to do shenanigans to convert i8 -> u8 anyway, I opted to cut it off after the first NUL byte.

alex · 2020-08-17T12:13:19Z

src/sched.rs

+        unsafe {
+            task_lock_helper(self.0);
+            // if only char were unsigned char
+            let v = (&*(&(*self.0).comm[..] as *const _ as *const [u8])).to_vec();


It gets you an [i8; 16] (i.e. char[16]) and this seems to be the preferred non-transmute way to turn that into a [u8]. We should probably make a wrapper for this, this is sort of the job of OsStr or something.

This is less shenanigous now. (It's more verbose but I hope the optimizer knows what it's doing.)

src/sched.rs

tests/for-each-process/src/lib.rs

geofft · 2020-08-17T15:29:29Z

Is there a good reference on RCUs to get up to speed?

There's an overview and some links in #143.

I'm about 80% sure that "crossbeam, but you guarantee epochs end because the kernel has precise control over scheduling" is a good tl;dr.

The basic idea is it's an extremely reader-optimized reader-writer lock pattern. A "writer" can't modify an RCU-protected structure directly. Instead, it makes a copy of the thing it wants to modify, updates that, atomically updates a pointer, waits for all readers of the old thing to be done, and then frees the old thing. If you know that readers exit quickly, then you can do this with almost no overhead on the reader side and without having an actual lock in memory to point to.

So it's mostly useful for protecting linked lists and other structures, like the list of processes in this example. It does not protect data that isn't behind a pointer, which means A) it's not a tool for multiple writers to coordinate (you want a real lock for that) and B) as in this example, actually accessing data inside a specific item in the linked list in a race-free way usually requires taking a real lock.

nelhage · 2020-08-17T15:41:00Z

@nelhage's comment leaves me a bit concerned that it's incredibly easy to accidentally violate the rules leading to deadlocks. Is that just life in C as well?

This is correct. The kernel rules about (a) when it's safe to schedule(), and (b) what operations will or won't schedule() are subtle and very easy to violate. And, of course, not uniformly documented, just a matter of oral tradition and lore and smatterings of references in different documents.

geofft · 2020-08-17T15:50:28Z

I think this is an equivalent problem to only calling async-signal-safe functions from a signal handler, or only calling kmalloc(GFP_NOWAIT) instead of kmalloc(GFP_KERNEL) in an interrupt. It seems like it's theoretically possible for Rust to enforce this by putting annotations on functions, I just don't know how.

geofft · 2020-08-17T21:53:54Z

https://github.com/japaric/cargo-call-stack might be some inspiration for "you can't call function Y from function X". It seems to work by parsing LLVM IR (it has its own parser) and constructing a call graph.

geofft · 2020-08-21T14:33:36Z

I think this is ready to land - any concerns about the current API?

alex · 2020-08-21T14:35:50Z

Only that I need to read all the RCU docs and think hard :-)

…

On Fri, Aug 21, 2020 at 10:33 AM Geoffrey Thomas ***@***.***> wrote: I think this is ready to land - any concerns about the current API? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#250 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAAGBBCLNQJJZZ3LCUOH4LSB2AU7ANCNFSM4QBHQJIA> .

-- All that is necessary for evil to succeed is for good people to do nothing.

alex · 2020-08-21T23:13:41Z

src/sched.rs

+                Some(lvalue) => {
+                    *lvalue = *byte as _;
+                }
+                None => {


Is this even reachable? Can you zip .comm and result together?

Hopefully not?

Probably? I mean really I want to just call copy_from_slice but there's no way to do that safely while casting.

FWIW, this code gets much simpler if we're using ArrayVec :-)

alex · 2020-08-21T23:14:20Z

src/sched.rs

+    _g: &'g rcu::RcuReadGuard,
+}
+
+pub fn each_process(g: &rcu::RcuReadGuard) -> EachProcess {


Should this return impl Iterator<Item=TaskStruct>>? no reason to have this type in the public API, right?

src/sched.rs

alex · 2020-08-21T23:16:52Z

tests/for-each-process/src/lib.rs

+        let g = rcu::RcuReadGuard::new();
+        for mut p in sched::each_process(&g) {
+            let comm = p.comm();
+            let comm_until_nul = comm.split(|c| *c == 0).next().unwrap();


Will there be a null if the comm name uses up the full buffer?

Yes, the buffer as returned by the kernel always includes at least one trailing NUL, but more importantly, split() always returns at least one element so we're fine either way.

tests/for-each-process/tests/tests.rs

alex · 2020-08-22T00:16:17Z

src/sched.rs

+    /// typically the base name of the command, and does not have the
+    /// full path or arguments. It's a fixed-sized set of bytes, but by
+    /// convention it's interpreted as NUL-terminated.
+    pub fn comm(&mut self) -> [u8; bindings::TASK_COMM_LEN as usize] {


On thinking more... do we need a type that's backed by a fixed size array, but maintains a "how much of this is valid" usize and derefs to slice/mut slice?

Yes. smallvec is basically this except that it falls back to Vec; I think we want one that just doesn't fall back.

While we're at it I also want to bury the i8/u8 nonsense under an abstraction too (i.e. impl Deref<Target=[u8]> for something constructed from an &[i8]). Really what I think I want is OsStr from libstd :(

https://crates.io/crates/arrayvec

Add an RcuReadGuard object for calling functions that expect to be RCU-protected, and add a Rust equivalent of for_each_process and a very light task_struct bindings sufficient to write a simple test/demo.

fbq · 2020-08-27T15:13:54Z

src/sched.rs

+        // - next_task calls rcu_dereference internally, which is safe
+        //   because we hold self._g.
+        // - The returned reference has lifetime 'g, which is valid
+        //   because self._g lives at least that long.


Please note that holding RCU doesn't guarantee the task list to be unchanged, RCU read-side critical section here only guarantees that the task_struct we reference here won't get freed (in side the RCU read-side critical section). We use another lock "tasklist_lock" to protect the tasklist.

geofft requested a review from alex August 17, 2020 04:51

geofft mentioned this pull request Aug 17, 2020

Add RCU bindings #143

Open

nelhage reviewed Aug 17, 2020

View reviewed changes

alex reviewed Aug 17, 2020

View reviewed changes

geofft commented Aug 17, 2020

View reviewed changes

src/sched.rs Show resolved Hide resolved

src/sched.rs Outdated Show resolved Hide resolved

tests/for-each-process/src/lib.rs Outdated Show resolved Hide resolved

geofft force-pushed the rcu branch from 9e7eba3 to a2e3469 Compare August 19, 2020 00:04

geofft changed the title ~~WIP: RCU~~ Add RCU read bindings (#143) and each_process() iterator Aug 19, 2020

geofft marked this pull request as ready for review August 19, 2020 00:18

geofft force-pushed the rcu branch from a2e3469 to 10f69c7 Compare August 19, 2020 00:20

alex reviewed Aug 21, 2020

View reviewed changes

alex reviewed Aug 22, 2020

View reviewed changes

Add RCU read bindings (#143) and each_process() iterator

6b13c49

Add an RcuReadGuard object for calling functions that expect to be RCU-protected, and add a Rust equivalent of for_each_process and a very light task_struct bindings sufficient to write a simple test/demo.

geofft force-pushed the rcu branch from 10f69c7 to 6b13c49 Compare August 22, 2020 00:59

fbq reviewed Aug 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RCU read bindings (#143) and each_process() iterator #250

Add RCU read bindings (#143) and each_process() iterator #250

geofft commented Aug 17, 2020 •

edited

Loading

geofft commented Aug 17, 2020

nelhage Aug 17, 2020

geofft Aug 17, 2020

nelhage Aug 17, 2020

geofft Aug 19, 2020

nelhage Aug 19, 2020

alex left a comment

alex Aug 17, 2020

geofft Aug 17, 2020

nelhage Aug 17, 2020

geofft Aug 17, 2020

geofft Aug 19, 2020

alex Aug 17, 2020

geofft Aug 17, 2020

geofft Aug 19, 2020

geofft commented Aug 17, 2020

nelhage commented Aug 17, 2020

geofft commented Aug 17, 2020 •

edited

Loading

geofft commented Aug 17, 2020

geofft commented Aug 21, 2020

alex commented Aug 21, 2020 via email

alex Aug 21, 2020

geofft Aug 22, 2020

alex Aug 22, 2020

alex Aug 21, 2020

alex Aug 21, 2020

geofft Aug 22, 2020

alex Aug 22, 2020

geofft Aug 22, 2020

alex Aug 22, 2020

fbq Aug 27, 2020

Add RCU read bindings (#143) and each_process() iterator #250

Are you sure you want to change the base?

Add RCU read bindings (#143) and each_process() iterator #250

Conversation

geofft commented Aug 17, 2020 • edited Loading

geofft commented Aug 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geofft commented Aug 17, 2020

nelhage commented Aug 17, 2020

geofft commented Aug 17, 2020 • edited Loading

geofft commented Aug 17, 2020

geofft commented Aug 21, 2020

alex commented Aug 21, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geofft commented Aug 17, 2020 •

edited

Loading

geofft commented Aug 17, 2020 •

edited

Loading